14 research outputs found
Guarded Policy Optimization with Imperfect Online Demonstrations
The Teacher-Student Framework (TSF) is a reinforcement learning setting where
a teacher agent guards the training of a student agent by intervening and
providing online demonstrations. Assuming optimal, the teacher policy has the
perfect timing and capability to intervene in the learning process of the
student agent, providing safety guarantee and exploration guidance.
Nevertheless, in many real-world settings it is expensive or even impossible to
obtain a well-performing teacher policy. In this work, we relax the assumption
of a well-performing teacher and develop a new method that can incorporate
arbitrary teacher policies with modest or inferior performance. We instantiate
an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared
Control (TS2C), which incorporates teacher intervention based on
trajectory-based value estimation. Theoretical analysis validates that the
proposed TS2C algorithm attains efficient exploration and substantial safety
guarantee without being affected by the teacher's own performance. Experiments
on various continuous control tasks show that our method can exploit teacher
policies at different performance levels while maintaining a low training cost.
Moreover, the student policy surpasses the imperfect teacher policy in terms of
higher accumulated reward in held-out testing environments. Code is available
at https://metadriverse.github.io/TS2C.Comment: Accepted at ICLR 2023 (top 25%
MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning
Driving safely requires multiple capabilities from human and intelligent
agents, such as the generalizability to unseen environments, the safety
awareness of the surrounding traffic, and the decision-making in complex
multi-agent settings. Despite the great success of Reinforcement Learning (RL),
most of the RL research works investigate each capability separately due to the
lack of integrated environments. In this work, we develop a new driving
simulation platform called MetaDrive to support the research of generalizable
reinforcement learning algorithms for machine autonomy. MetaDrive is highly
compositional, which can generate an infinite number of diverse driving
scenarios from both the procedural generation and the real data importing.
Based on MetaDrive, we construct a variety of RL tasks and baselines in both
single-agent and multi-agent settings, including benchmarking generalizability
across unseen scenes, safe exploration, and learning multi-agent traffic. The
generalization experiments conducted on both procedurally generated scenarios
and real-world scenarios show that increasing the diversity and the size of the
training set leads to the improvement of the generalizability of the RL agents.
We further evaluate various safe reinforcement learning and multi-agent
reinforcement learning algorithms in MetaDrive environments and provide the
benchmarks. Source code, documentation, and demo video are available at
https://metadriverse.github.io/metadrive . More research projects based on
MetaDrive simulator are listed at https://metadriverse.github.ioComment: Source code, documentation, and demo video are available at
https://metadriverse.github.io/metadrive . More research projects based on
MetaDrive simulator are listed at https://metadriverse.github.i
PrefRec: Recommender Systems with Human Preferences for Reinforcing Long-term User Engagement
Current advances in recommender systems have been remarkably successful in
optimizing immediate engagement. However, long-term user engagement, a more
desirable performance metric, remains difficult to improve. Meanwhile, recent
reinforcement learning (RL) algorithms have shown their effectiveness in a
variety of long-term goal optimization tasks. For this reason, RL is widely
considered as a promising framework for optimizing long-term user engagement in
recommendation. Though promising, the application of RL heavily relies on
well-designed rewards, but designing rewards related to long-term user
engagement is quite difficult. To mitigate the problem, we propose a novel
paradigm, recommender systems with human preferences (or Preference-based
Recommender systems), which allows RL recommender systems to learn from
preferences about users historical behaviors rather than explicitly defined
rewards. Such preferences are easily accessible through techniques such as
crowdsourcing, as they do not require any expert knowledge. With PrefRec, we
can fully exploit the advantages of RL in optimizing long-term goals, while
avoiding complex reward engineering. PrefRec uses the preferences to
automatically train a reward function in an end-to-end manner. The reward
function is then used to generate learning signals to train the recommendation
policy. Furthermore, we design an effective optimization method for PrefRec,
which uses an additional value function, expectile regression and reward model
pre-training to improve the performance. We conduct experiments on a variety of
long-term user engagement optimization tasks. The results show that PrefRec
significantly outperforms previous state-of-the-art methods in all the tasks
State Regularized Policy Optimization on Data with Dynamics Shift
In many real-world scenarios, Reinforcement Learning (RL) algorithms are
trained on data with dynamics shift, i.e., with different underlying
environment dynamics. A majority of current methods address such issue by
training context encoders to identify environment parameters. Data with
dynamics shift are separated according to their environment parameters to train
the corresponding policy. However, these methods can be sample inefficient as
data are used \textit{ad hoc}, and policies trained for one dynamics cannot
benefit from data collected in all other environments with different dynamics.
In this paper, we find that in many environments with similar structures and
different dynamics, optimal policies have similar stationary state
distributions. We exploit such property and learn the stationary state
distribution from data with dynamics shift for efficient data reuse. Such
distribution is used to regularize the policy trained in a new environment,
leading to the SRPO (\textbf{S}tate \textbf{R}egularized \textbf{P}olicy
\textbf{O}ptimization) algorithm. To conduct theoretical analyses, the
intuition of similar environment structures is characterized by the notion of
homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on
policies regularized by the stationary state distribution. In practice, SRPO
can be an add-on module to context-based algorithms in both online and offline
RL settings. Experimental results show that SRPO can make several context-based
algorithms far more data efficient and significantly improve their overall
performance.Comment: Preprint. Under Revie
AdaRec: Adaptive Sequential Recommendation for Reinforcing Long-term User Engagement
Growing attention has been paid to Reinforcement Learning (RL) algorithms
when optimizing long-term user engagement in sequential recommendation tasks.
One challenge in large-scale online recommendation systems is the constant and
complicated changes in users' behavior patterns, such as interaction rates and
retention tendencies. When formulated as a Markov Decision Process (MDP), the
dynamics and reward functions of the recommendation system are continuously
affected by these changes. Existing RL algorithms for recommendation systems
will suffer from distribution shift and struggle to adapt in such an MDP. In
this paper, we introduce a novel paradigm called Adaptive Sequential
Recommendation (AdaRec) to address this issue. AdaRec proposes a new
distance-based representation loss to extract latent information from users'
interaction trajectories. Such information reflects how RL policy fits to
current user behavior patterns, and helps the policy to identify subtle changes
in the recommendation system. To make rapid adaptation to these changes, AdaRec
encourages exploration with the idea of optimism under uncertainty. The
exploration is further guarded by zero-order action optimization to ensure
stable recommendation quality in complicated environments. We conduct extensive
empirical analyses in both simulator-based and live sequential recommendation
tasks, where AdaRec exhibits superior long-term performance compared to all
baseline algorithms.Comment: Preprint. Under Revie
Gene Expression Profiles Deciphering Rice Phenotypic Variation between Nipponbare (Japonica) and 93-11 (Indica) during Oxidative Stress
Rice is a very important food staple that feeds more than half the world's population. Two major Asian cultivated rice (Oryza sativa L.) subspecies, japonica and indica, show significant phenotypic variation in their stress responses. However, the molecular mechanisms underlying this phenotypic variation are still largely unknown. A common link among different stresses is that they produce an oxidative burst and result in an increase of reactive oxygen species (ROS). In this study, methyl viologen (MV) as a ROS agent was applied to investigate the rice oxidative stress response. We observed that 93-11 (indica) seedlings exhibited leaf senescence with severe lesions under MV treatment compared to Nipponbare (japonica). Whole-genome microarray experiments were conducted, and 1,062 probe sets were identified with gene expression level polymorphisms between the two rice cultivars in addition to differential expression under MV treatment, which were assigned as Core Intersectional Probesets (CIPs). These CIPs were analyzed by gene ontology (GO) and highlighted with enrichment GO terms related to toxin and oxidative stress responses as well as other responses. These GO term-enriched genes of the CIPs include glutathine S-transferases (GSTs), P450, plant defense genes, and secondary metabolism related genes such as chalcone synthase (CHS). Further insertion/deletion (InDel) and regulatory element analyses for these identified CIPs suggested that there may be some eQTL hotspots related to oxidative stress in the rice genome, such as GST genes encoded on chromosome 10. In addition, we identified a group of marker genes individuating the japonica and indica subspecies. In summary, we developed a new strategy combining biological experiments and data mining to study the possible molecular mechanism of phenotypic variation during oxidative stress between Nipponbare and 93-11. This study will aid in the analysis of the molecular basis of quantitative traits
Reliability and validity of the international dementia alliance schedule for the assessment and staging of care in China
Abstract Background Clinical and social services both are important for dementia care. The International Dementia Alliance (IDEAL) Schedule for the Assessment and Staging of Care was developed to guide clinical and social care for dementia. Our study aimed to assess the validity and reliability of the IDEAL schedule in China. Methods Two hundred eighty-two dementia patients and their caregivers were recruited from 15 hospitals in China. Each patient-caregiver dyad was assessed with the IDEAL schedule by a rater and an observer simultaneously. The Clinical Dementia Rating (CDR), Mini-Mental Status Examination (MMSE), and Caregiver Burden Inventory (CBI) were assessed for criterion validity. IDEAL repeated assessment was conducted 7-10Β days after the initial interview for 62 dyads. Results Two hundred seventy-seven patient-caregiver dyads completed the IDEAL assessment. Inter-rater reliability for the total score of the IDEAL schedule was 0.93 (95%CIβ=β0.92-0.95). The inter-class coefficient for the total score of IDEAL was 0.95 for the interviewers and 0.93 for the silent raters. The IDEAL total score correlated with the global CDR score (Οβ=β0.72, pβ<β0.001), the CDR-sum of box (CDR-SOB, Οβ=β0.74, pβ<β0.001), the total score of MMSE (Οβ=ββ0.65, pβ<β0.001) and CBI (Οβ=β0.70, pβ<β0.001). All item scores of the IDEAL schedule were associated with the CDR-SOB (Οβ=β0.17 ~ 0.79, all pβ<β0.05). Conclusion The IDEAL schedule is a valid and reliable tool for the staging of care for dementia in the Chinese population
Additional file 3: Table S2. of Reliability and validity of the international dementia alliance schedule for the assessment and staging of care in China
Correlation of item scores of IDEAL, Chinese version, against factor scores of CBI. (DOCX 16ΓΒ kb
Additional file 1: of Reliability and validity of the international dementia alliance schedule for the assessment and staging of care in China
Study groups, raters and participating hospitals (in alphabetic order by province or administrative city). (DOCX 15ΓΒ kb